{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training a sentiment analysis model with Chainer\n",
    "\n",
    "In this notebook, we will train a model that will allow us to analyze text for positive or negative sentiment. The model will use a recurrent neural network with long short-term memory blocks to generate word embeddings.\n",
    "\n",
    "The Chainer script runs inside of a Docker container running on SageMaker. For more information about the Chainer container, see the sagemaker-chainer-containers repository and the sagemaker-python-sdk repository:\n",
    "\n",
    "* https://github.com/aws/sagemaker-chainer-containers\n",
    "* https://github.com/aws/sagemaker-python-sdk\n",
    "\n",
    "For more on Chainer, please visit the Chainer repository:\n",
    "\n",
    "* https://github.com/chainer/chainer\n",
    "\n",
    "The code in this notebook is adapted from the [text classification](https://github.com/chainer/chainer/tree/master/examples/text_classification) example in the Chainer repository."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Setup\n",
    "from sagemaker import get_execution_role\n",
    "import sagemaker\n",
    "\n",
    "sagemaker_session = sagemaker.Session()\n",
    "\n",
    "# This role retrieves the SageMaker-compatible role used by this Notebook Instance.\n",
    "role = get_execution_role()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading training and test data\n",
    "\n",
    "We use helper functions provided by `chainer` to download and preprocess the data. We'll be using the [Stanford Sentiment Treebank dataset](https://nlp.stanford.edu/sentiment/), which consists of sentence fragments from movie reviews along with labels indicating whether the sentence has a positive sentiment (1) or negative sentiment (0)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import dataset\n",
    "\n",
    "file_paths = dataset.download_dataset(\"stsa.binary\")\n",
    "\n",
    "new_file_paths = dataset.get_stsa_dataset(file_paths)\n",
    "train, test, vocab = dataset.get_stsa_dataset(file_paths)\n",
    "\n",
    "with open(file_paths[0], 'r') as f:\n",
    "    for i in range(20):\n",
    "        line = f.readline()\n",
    "        print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Uploading the data\n",
    "\n",
    "We save the preprocessed data to the local filesystem, and then use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the S3 location, which we will use when we start the Training Job."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import shutil\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "train_data = [element[0] for element in train]\n",
    "train_labels = [element[1] for element in train]\n",
    "\n",
    "test_data = [element[0] for element in test]\n",
    "test_labels = [element[1] for element in test]\n",
    "\n",
    "\n",
    "try:\n",
    "    os.makedirs('/tmp/data/train_sentiment')\n",
    "    os.makedirs('/tmp/data/test_sentiment')\n",
    "    os.makedirs('/tmp/data/vocab')\n",
    "    np.savez('/tmp/data/train_sentiment/train.npz',data=train_data, labels=train_labels)\n",
    "    np.savez('/tmp/data/test_sentiment/test.npz', data=test_data, labels=test_labels)\n",
    "    np.save('/tmp/data/vocab/vocab.npy', vocab)\n",
    "    train_input = sagemaker_session.upload_data(\n",
    "                      path=os.path.join('/tmp', 'data', 'train_sentiment'),\n",
    "                      key_prefix='notebook/chainer_sentiment/train')\n",
    "    test_input = sagemaker_session.upload_data(\n",
    "                     path=os.path.join('/tmp', 'data', 'test_sentiment'),\n",
    "                     key_prefix='notebook/chainer_sentiment/test')\n",
    "    vocab_input = sagemaker_session.upload_data(\n",
    "                      path=os.path.join('/tmp', 'data', 'vocab'),\n",
    "                      key_prefix='notebook/chainer_sentiment/vocab')\n",
    "finally:\n",
    "    shutil.rmtree('/tmp/data')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Writing the Chainer script to run on Amazon SageMaker\n",
    "\n",
    "### Training\n",
    "\n",
    "We need to provide a training script that can run on the SageMaker platform. The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, such as:\n",
    "\n",
    "* `SM_MODEL_DIR`: A string representing the path to the directory to write model artifacts to.\n",
    "  These artifacts are uploaded to S3 for model hosting.\n",
    "* `SM_NUM_GPUS`: An integer representing the number of GPUs available to the host.\n",
    "* `SM_OUTPUT_DIR`: A string representing the filesystem path to write output artifacts to. Output artifacts may\n",
    "  include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed\n",
    "  and uploaded to S3 to the same S3 prefix as the model artifacts.\n",
    "\n",
    "Supposing two input channels, 'train' and 'test', were used in the call to the Chainer estimator's ``fit()`` method,\n",
    "the following will be set, following the format `SM_CHANNEL_[channel_name]`:\n",
    "\n",
    "* `SM_CHANNEL_TRAIN`: A string representing the path to the directory containing data in the 'train' channel\n",
    "* `SM_CHANNEL_TEST`: Same as above, but for the 'test' channel.\n",
    "\n",
    "A typical training script loads data from the input channels, configures training with hyperparameters, trains a model, and saves a model to `model_dir` so that it can be hosted later. Hyperparameters are passed to your script as arguments and can be retrieved with an `argparse.ArgumentParser` instance. For example, the script run by this notebook starts with the following:\n",
    "\n",
    "```python\n",
    "import argparse\n",
    "import os\n",
    "\n",
    "if __name__=='__main__':\n",
    "        \n",
    "    parser = argparse.ArgumentParser()\n",
    "    \n",
    "    parser.add_argument('--epochs', type=int, default=30)\n",
    "    parser.add_argument('--batch-size', type=int, default=64)\n",
    "    parser.add_argument('--dropout', type=float, default=0.4)\n",
    "    parser.add_argument('--num-layers', type=int, default=1)\n",
    "    parser.add_argument('--num-units', type=int, default=300)\n",
    "    parser.add_argument('--model-type', type=str, default='rnn')\n",
    "\n",
    "    # Data, model, and output directories. These are required.\n",
    "    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])\n",
    "    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])\n",
    "    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])\n",
    "    parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])\n",
    "    parser.add_argument('--vocab', type=str, default=os.environ['SM_CHANNEL_VOCAB'])\n",
    "    \n",
    "    args, _ = parser.parse_known_args()\n",
    "    \n",
    "    num_gpus = int(os.environ['SM_NUM_GPUS'])\n",
    "    \n",
    "    # ... load from args.train and args.test, train a model, write model to args.model_dir.\n",
    "```\n",
    "\n",
    "Because the Chainer container imports your training script, you should always put your training code in a main guard (`if __name__=='__main__':`) so that the container does not inadvertently run your training code at the wrong point in execution.\n",
    "\n",
    "For more information about training environment variables, please visit https://github.com/aws/sagemaker-containers.\n",
    "\n",
    "### Hosting and Inference\n",
    "\n",
    "We use a single script to train and host the Chainer model. You can also write separate scripts for training and hosting. In contrast with the training script, the hosting script requires you to implement functions with particular function signatures (or rely on defaults for those functions).\n",
    "\n",
    "These functions load your model, deserialize data sent by a client, obtain inferences from your hosted model, and serialize predictions back to a client:\n",
    "\n",
    "\n",
    "* **`model_fn(model_dir)` (always required for hosting)**: This function is invoked to load model artifacts from those written into `model_dir` during training.\n",
    "\n",
    "\n",
    "* `input_fn(input_data, content_type)`: This function is invoked to deserialize prediction data when a prediction request is made. The return value is passed to predict_fn. `input_data` is the serialized input data in the body of the prediction request, and `content_type`, the MIME type of the data.\n",
    "  \n",
    "  \n",
    "* `predict_fn(input_data, model)`: This function accepts the return value of `input_fn` as the `input_data` parameter and the return value of `model_fn` as the `model` parameter and returns inferences obtained from the model.\n",
    "  \n",
    "  \n",
    "* `output_fn(prediction, accept)`: This function is invoked to serialize the return value from `predict_fn`, which is passed in as the `prediction` parameter, back to the SageMaker client in response to prediction requests.\n",
    "\n",
    "\n",
    "`model_fn` is always required, but default implementations exist for the remaining functions. These default implementations can deserialize a NumPy array, invoking the model's `__call__` method on the input data, and serialize a NumPy array back to the client.\n",
    "\n",
    "Please examine the script below. Training occurs behind the main guard, which prevents the function from being run when the script is imported, and `model_fn` loads the model saved into `model_dir` during training.\n",
    "\n",
    "`input_fn` deserializes the input data into a NumPy array from the default data format from the predictor Chainer uses to serialize inference data in the Python SDK (the [NPY format](https://docs.scipy.org/doc/numpy-1.14.0/neps/npy-format.html)). `predict_fn` formats words and converts them into word embeddings, obtains predictions from the model containing the predicted sentiment and returns a NumPy array that `output_fn` serializes to the NPY format back to the client.\n",
    "\n",
    "\n",
    "\n",
    "For more on writing Chainer scripts to run on SageMaker, or for more on the Chainer container itself, please see the following repositories: \n",
    "\n",
    "* For writing Chainer scripts to run on SageMaker: https://github.com/aws/sagemaker-python-sdk\n",
    "* For more on the Chainer container and default hosting functions: https://github.com/aws/sagemaker-chainer-containers\n",
    "\n",
    "The whole script `src/sentiment_analysis.py` is displayed below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "!pygmentize 'src/sentiment_analysis.py'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Running the training script on SageMaker\n",
    "\n",
    "To train a model with a Chainer script, we construct a ```Chainer``` estimator using the [sagemaker-python-sdk](https://github.com/aws/sagemaker-python-sdk). We can pass in an `entry_point`, the name of a script that contains a couple of functions with certain signatures (`train` and `model_fn`). This script will be run on SageMaker in a container that invokes these functions to train and load Chainer models.\n",
    "\n",
    "The ```Chainer``` class allows us to run our training function as a training job on SageMaker infrastructure. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. In this case we will run our training job on a `ml.p2.xlarge` instance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sagemaker.chainer.estimator import Chainer\n",
    "\n",
    "chainer_estimator = Chainer(entry_point='sentiment_analysis.py',\n",
    "                            source_dir=\"src\",\n",
    "                            role=role,\n",
    "                            sagemaker_session=sagemaker_session,\n",
    "                            train_instance_count=1,\n",
    "                            train_instance_type='ml.p2.xlarge',\n",
    "                            hyperparameters={'epochs': 10, 'batch-size': 64})\n",
    "\n",
    "chainer_estimator.fit({'train': train_input,\n",
    "                       'test': test_input,\n",
    "                       'vocab': vocab_input})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our Chainer script writes various artifacts, such as plots, to a directory `output_data_dir`, the contents of which which SageMaker uploads to S3. Now we download and extract these artifacts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from s3_util import retrieve_output_from_s3\n",
    "\n",
    "chainer_training_job = chainer_estimator.latest_training_job.name\n",
    "\n",
    "desc = sagemaker_session.sagemaker_client. \\\n",
    "           describe_training_job(TrainingJobName=chainer_training_job)\n",
    "output_data = desc['ModelArtifacts']['S3ModelArtifacts'].replace('model.tar.gz', 'output.tar.gz')\n",
    "\n",
    "retrieve_output_from_s3(output_data, 'output/sentiment')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These plots show the accuracy and loss over epochs.\n",
    "\n",
    "In our user script, `sentiment_analysis.py`, at the end of the `train` function. Our model overfits, but we save only the best model for deployment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import Image\n",
    "from IPython.display import display\n",
    "\n",
    "accuracy_graph = Image(filename=\"output/sentiment/accuracy.png\",\n",
    "                       width=800,\n",
    "                       height=800)\n",
    "loss_graph = Image(filename=\"output/sentiment/loss.png\",\n",
    "                   width=800,\n",
    "                   height=800)\n",
    "\n",
    "display(accuracy_graph, loss_graph)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deploying the Trained Model\n",
    "\n",
    "After training, we use the Chainer estimator object to create and deploy a hosted prediction endpoint. We can use a CPU-based instance for inference (in this case an `ml.m4.xlarge`), even though we trained on GPU instances.\n",
    "\n",
    "The predictor object returned by `deploy` lets us call the new endpoint and perform inference on our sample images.\n",
    "\n",
    "At the end of training, `sentiment_analysis.py` saves the trained model, the vocabulary, and a dictionary of model properties that are used to reconstruct the model. These model artifacts are loaded in `model_fn` when the model is hosted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor = chainer_estimator.deploy(initial_instance_count=1, instance_type='ml.m4.xlarge')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predicting using SageMaker Endpoint\n",
    "\n",
    "The Chainer predictor converts its input into a NumPy array, which it serializes and sends to the hosted model.\n",
    "The `predict_fn` in `sentiment_analysis.py` receives this NumPy array and uses the loaded model to make predictions on the input data, which it returns as a NumPy array back to the Chainer predictor.\n",
    "\n",
    "We predict against the hosted model on a batch of sentences. The output, as defined by `predict_fn`, consists of the processed input sentence, the prediction, and the score for that prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentences = ['It is fun and easy to train Chainer models on Amazon SageMaker!',\n",
    "             'It used to be slow, difficult, and laborious to train and deploy a model to production.',\n",
    "             'But now it is super fast to deploy to production. And I love it when my model generalizes!',]\n",
    "predictions = predictor.predict(sentences)\n",
    "for prediction in predictions:\n",
    "    sentence, prediction, score = prediction\n",
    "    print('sentence: {}\\nprediction: {}\\nscore: {}\\n'.format(sentence, prediction, score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now predict against sentences in the test set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open(file_paths[1], 'r') as f:\n",
    "    sentences = f.readlines(2000)\n",
    "    sentences = [sentence[1:].strip() for sentence in sentences]\n",
    "    predictions = predictor.predict(sentences)\n",
    "\n",
    "predictions = predictor.predict(sentences)\n",
    "\n",
    "for prediction in predictions:\n",
    "    sentence, prediction, score = prediction\n",
    "    print('sentence: {}\\nprediction: {}\\nscore: {}\\n'.format(sentence, prediction, score))\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cleanup\n",
    "\n",
    "After you have finished with this example, remember to delete the prediction endpoint to release the instance(s) associated with it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "chainer_estimator.delete_endpoint()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "conda_chainer_p36",
   "language": "python",
   "name": "conda_chainer_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  },
  "notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.      amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
